BMC Medical Genomics — Latest Matching Preprints

1

Three Steps Novel Hard Margin Ensemble Machine Learning Method Classifies Uncertain Mefv Gene Variants

Alay, M. T.; Demir, I.; Kirisci, M.

2023-04-18 genetic and genomic medicine 10.1101/2023.04.08.23288306 medRxiv

Top 0.1%

23.0%

Show abstract

IntroductionThe International Study Group for Systemic Autoinflammatory Diseases (INSAID) consensus criteria revealed that the clinical outcomes of more than half of the MEFV gene variants are uncertain. We aimed to detect more accurate classifications of MEFV variants while simultaneously reducing MEFV variant uncertainty. Material-MethodsWe extracted variants of the MEFV gene from the infevers database. We then determined the optimal number of in silico instruments for our model. On the training dataset, we implemented seven machine learning algorithms on MEFV gene variants with known clinical effects. We evaluated the effectiveness of our model in three steps: First, we performed machine-learning algorithms on the training dataset and implemented those with a prediction accuracy of greater than 90 percent. Second, we compared our gene-level and protein-level prediction results. Finally, we compared our prediction results to clinical outcomes. ResultsOur analysis included 266 of 381 MEFV gene variants and four computational tools (Revel, SIFT, MetaLR, and FATHMM). In our training dataset, the accuracy of three machine learning algorithms (RF: 100%, CRAT: 100%, and KNN: 91%) exceeded the threshold value. Thus, the dataset contained 134 likely pathogenic (LP) variants and 132 likely benign (LB) variants. We found that B30.2 domain variants were 2.5 times more likely to be LP than LB ({chi}2:12.693, p < 0.001, OR: 2.595 [1.532-4.132]. DiscussionConsidering that the clinical effects of 60% of MEFV gene variants have not yet been determined, a combined evaluation of our methods and patients clinical manifestations significantly simplifies the interpretation of unknown variants.

2

Integration of immunome with disease gene network reveals pleiotropy and novel drug repurposing targets

Devaprasad, A.; Radstake, T. R.; Pandit, A.

2019-12-13 bioinformatics 10.1101/2019.12.12.874321 medRxiv

Top 0.1%

20.0%

Show abstract

ObjectiveDevelopment and progression of immune-mediated inflammatory diseases (IMIDs) involve intricate dysregulation of the disease associated genes (DAGs) and their expressing immune cells. Due to the complex molecular mechanism, identifying the top disease associated cells (DACs) in IMIDs has been challenging. Here, we aim to identify the top DACs and DAGs to help understand the cellular mechanism involved in IMIDs and further explore therapeutic strategies. MethodUsing transcriptome profiles of 40 different immune cells, unsupervised machine learning, and disease-gene networks, we constructed the Disease-gene IMmune cell Expression (DIME) network, and identified top DACs and DAGs of 12 phenotypically different IMIDs. We compared the DIME networks of IMIDs to identify common pathways between them. We used the common pathways and publicly available drug-gene network to identify promising drug repurposing targets. ResultWe found CD4+Treg, CD4+Th1, and NK cells as top DACs in the inflammatory arthritis such as ankylosing spondylitis (AS), psoriatic arthritis, and rheumatoid arthritis (RA); neutrophils, granulocytes and BDCA1+CD14+ cells in systemic lupus erythematosus and systemic scleroderma; ILC2, CD4+Th1, CD4+Treg, and NK cells in the inflammatory bowel diseases (IBDs). We identified lymphoid cells (CD4+Th1, CD4+Treg, and NK) and their associated pathways to be important in HLA-B27 type diseases (psoriasis, AS, and IBDs) and in primary-joint-inflammation-based inflammatory arthritis (AS and RA). Based on the common cellular mechanisms, we identified lifitegrast as potential drug repurposing candidate for Crohns disease, and other IMIDs. ConclusionOur method identified top DACs, DAGs, common pathways, and proposed potential drug repurposing targets between IMIDs. To extend our method to other diseases, we built the DIME tool. Thus paving way for future (pre-)clinical research.

3

Single-cell RNA-seq analysis of human coronary arteries using an enhanced workflow reveals SMC transitions and candidate drug targets

Ma, W. F.; Hodonsky, C. J.; Turner, A. W.; Wong, D.; Song, Y.; Barrientos, N. B.; Miller, C. L.

2020-10-27 genomics 10.1101/2020.10.27.357715 medRxiv

Top 0.1%

19.2%

Show abstract

Background and AimsThe atherosclerotic plaque microenvironment is highly complex, and selective agents that modulate plaque stability or other plaque phenotypes are not yet available. We sought to investigate the human atherosclerotic cellular environment using scRNA-seq to uncover potential therapeutic approaches. We aimed to make our workflow user-friendly, reproducible, and applicable to other disease-specific scRNA-seq datasets. MethodsHere we incorporate automated cell labeling, pseudotemporal ordering, ligand-receptor evaluation, and drug-gene interaction analysis into an enhanced and reproducible scRNA-seq analysis workflow. Notably, we also developed an R Shiny based interactive web application to enable further exploration and analysis of the scRNA dataset. ResultsWe applied this analysis workflow to a human coronary artery scRNA dataset and revealed distinct derivations of chondrocyte-like and fibroblast-like cells from smooth muscle cells (SMCs), and show the key changes in gene expression along their de-differentiation path. We highlighted several key ligand-receptor interactions within the atherosclerotic environment through functional expression profiling and revealed several attractive avenues for future pharmacological repurposing in precision medicine. Further, our interactive web application, PlaqView (www.plaqview.com), allows other researchers to easily explore this dataset and benchmark applicable scRNA-seq analysis tools without prior coding knowledge. ConclusionsThese results suggest novel effects of chemotherapeutics on the atherosclerotic cellular environment and provide future avenues of studies in precision medicine. This publicly available workflow will also allow for more systematic and user-friendly analysis of scRNA datasets in other disease and developmental systems. PlaqView allows for rapid visualization and analysis of atherosclerosis scRNA-seq datasets without the need of prior coding experience. Future releases of PlaqView will feature additional larger scRNA-seq and scATAC-seq atherosclerosis-related datasets, thus providing a critical resource for the field by promoting data harmonization and biological interpretation.

4

Systematic GWAS-assessment of disease modules reveals a multi-omic MS module strongly associated with risk factors

Badam, T. V. S.; de Weerd, H. A.; Martinez-Enguita, D.; Olsson, T.; Alfredsson, L.; Kockum, I.; Jagodic, M.; Lubovac-Pilav, Z.; Gustafsson, M.

2020-10-26 bioinformatics 10.1101/2020.10.26.351783 medRxiv

Top 0.1%

18.6%

Show abstract

BackgroundThere are few (if any) practical guidelines for predictive and falsifiable multi-omics data integration that systematically integrate existing knowledge. Disease modules are popular concepts for interpreting genome-wide studies in medicine but have so far not been systematically evaluated and may lead to corroborating multi-omic modules. MethodsWe assessed eight module identification methods in 57 previously published expression and methylation studies of 19 diseases using GWAS enrichment analysis. Next, we applied the same strategy for multi-omics integration of 19 datasets of multiple sclerosis (MS), and further validated the resulting module using both GWAS and risk-factor associated genes from several independent cohorts. ResultsOur benchmark of modules showed that in immune-associated diseases modules inferred from clique-based methods were the most enriched for GWAS-genes. The multi-omics case study using MS revealed the robust identification of a module of 220 genes. Strikingly, most genes of the module was differentially methylated upon the action of one or several environmental risk factors in MS (n = 217, P = 10-47) and were also independently validated for association with five different risk factors of MS, which further stressed the high genetic and epigenetic relevance of the module for MS. ConclusionWe believe our analysis provides a workflow for selecting modules and our benchmark study may help further improvement of disease module methods. Moreover, we also stress that our methodology is generally applicable for combining and assessing the performance of multi-omics approaches for complex diseases.

5

Genome-wide Association Clustering Meta-analysis in European and Chinese Datasets for Systemic Lupus Erythematosus identifies new genes

Saeed, M.

2023-07-08 genetic and genomic medicine 10.1101/2023.07.07.23292357 medRxiv

Top 0.1%

17.7%

Show abstract

Genome-wide association studies (GWAS) face multiple challenges in order to identify reliable susceptibility genes for complex disorders, such as Systemic lupus erythematosus (SLE). These include high false positivity due to number of SNPs genotyped, false negativity due to statistical corrections and the proportional signals problem. Association clustering methods, by reducing the testing burden, have increased power than single variant analysis. Here, OASIS, a locus-based test, and its novel statistic, the OASIS locus index (OLI), is applied to European (EU) and Chinese (Chi) SLE GWAS to identify common significant non-HLA, autosomal genes. Six SLE dbGAP GWAS datasets, 4 EU and 2 Chi involving 19,710 SLE cases and 30,876 controls were analyzed. OLI is defined as the product of maximum -logP at a locus with the ratio of actual to predicted number of significant SNPs and compared against the standard P-value using Box plots and Wilcoxon Signed Rank Test. OLI outperformed the standard P-value statistic in detecting true associations (Wilcoxon Signed Test Z= - 4.11, P<1x10-4). Top non-HLA significant loci, common in both ethnicities were 2q32.2 (STAT4, rs4274624, P=9.7x10-66), 1q25.3 (SMG7, rs41272536, P=3.5x10-52), 7q32.1 (IRF5, rs35000415, P=1.9x10-45), 8p23.1 (BLK, rs2736345, P=1.5x10-25) and 6q23.3 (TNFAIP3, rs5029937, P=4.4x10-24). Overall, OASIS identified 19 highly significant and 16 modestly significant (P>10-8) non-HLA SLE associated genes common to EU and Chi ethnicities. Interaction of these 35 genes elucidated important SLE pathways viz NOD, TLR, JAK-STAT and RIG-1. OASIS aims to advance GWAS by rapid and cost-effective identification of genes of modest significance for complex disorders. Key MessagesO_LIGWAS are challenged by risk genes of modest effect. C_LIO_LIOASIS, a clustering algorithm, can help identify genes of modest significance for complex disorders such SLE, rapidly and cost-effectively using publicly available GWAS datasets. C_LIO_LIThis meta-analysis identified 35 genes common to both European and Chinese populations. C_LIO_LIInteraction of these genes identify major SLE pathways to be NOD, TLR, JAK-STAT and RIG-1. C_LI

6

Moving from GWAS signals to rare functional variation in inflammatory bowel disease through application of GenePy2 as a potential DNA biomarker

Cheng, G.; Ashton, J.; Collins, A.; Bettie, M.; Ennis, S.

2024-04-20 genetic and genomic medicine 10.1101/2024.04.19.24306093 medRxiv

Top 0.1%

15.1%

Show abstract

ObjectivesWe adopt a weighted variant burden score GenePy2.0 for the UK Biobank phase 2 cohort of inflammatory bowel disease (IBD), to explore potential genomic biomarkers underpinning IBDs known associations. DesignNucleating from IBD GWAS signals, we identified 794 GWAS loci, including target genes/LD-blocks (LDBs) based on linkage-disequilibrium (LD) and functional mapping. We calculated GenePy2.0-a burden score of target regions integrating variants with CADDPhred>15 weighted by deleteriousness and zygosity. Collating with other burden-based test, GenePy-based Mann-Whitney-U tests on cases/controls with varying extreme scores were used. Significance-levels and effect sizes were used for tuning the optimal GenePy thresholds for discriminating patients from controls. Individuals binarized GenePy status (above or below threshold) of candidate regions, was subject to itemset association test via the sparse Apriori algorithm. ResultsA tailored IBD cohort was curated (nCrohns_Disease(CD)=891, nUlcerative_Colitis(UC)=1409, nControls=60118). Analysing 885 unified target regions (794 GWAS loci and 104 monogenic genes with 13 overlaps), the GenePy approach detected statistical significance (permutation p<5.65x10-5) in 35 regions of CD and 25 of UC targets exerting risk and protective effects on the disease. Large effect sizes were observed, e.g. CYLD-AS1 (Mann-Whitney-{square}=0.89[CI:0.78-0.96]) in CD/controls with the top 1% highest scores of the gene. Itemset association learning further highlighted an intriguing signal whereby GenePy status of IL23R and NOD2 were mutually exclusive in CD but always co-occurring in controls. ConclusionGenePy score per IBD patient detected deleterious variation of large effect underpinning known IBD associations and proved itself a promising tool for genomic biomarker discovery. What is already known on this topicInflammatory bowel disease (IBD) is a genetically heterogeneous disease with both common polygenic, and rare monogenic, presentations. Previous studies have identified known genetic variants associated with disease. What this study addsA genomic biomarker tool, tailored for large cohort, GenePy2.0 is developed. Its rank-based test is more powerful than mutation-burden based test in validating known associations and finding new associations of IBD. We identified large risk and protective effects of pathogenic genes/loci in IBD, including expanding previous associations to wider genomic regions. How this study might affect research, practice or policyGenePy2.0 facilitates analysis of diseases with genetic heterogeneity and facilitates personalised genomic analysis on patients. The revealed genetic landscape of IBD captures both risk and protective effects of rare pathogenic variants, alongside more common variation. This, could provide a fresh angle for future targeted therapies in specific groups of patients.

7

Imbalanced expression for predicted high-impact, autosomal-dominant variants in a cohort of 3,818 healthy samples

de Klein, N.; van Dijk, F.; Deelen, P.; Urzua, C. G.; Claringbould, A.; Vosa, U.; Verlouw, J. A. M.; Monajemi, R.; 't Hoen, P. A. C.; Sinke, R. J.; BIOS Consortium, ; Swertz, M. A.; Franke, L.

2020-09-20 genomics 10.1101/2020.09.19.300095 medRxiv

Top 0.1%

15.0%

Show abstract

BackgroundOne of the growing problems in genome diagnostics is the increasing number of variants that get identified through genetic testing but for which it is unknown what the significance for the disease is (Variants of Unknown Significance - VUS)1,2. When these variants are observed in patients, clinicians need to be able to determine their relevance for causing the patients disease. Here we investigated whether allele-specific expression (ASE) can be used to prioritize disease-relevant VUS and therefore assist diagnostics. In order to do so, we conducted ASE analysis in RNA-seq data from 3,818 blood samples (part of the the Dutch BIOS biobank consortium), to ascertain how VUS affect gene expression. We compared the effect of VUS variants to variants that are predicted to have a high impact, and variants that are predicted to be pathogenic but are either recessive or autosomal-dominant with low penetrance. ResultsFor immune and haematological disorders, we observed that 24.7% of known pathogenic variants from ClinVar show allelic imbalance in blood, as compared to 6.6% of known benign variants with matching allele frequencies. However, for other types of disorders, ASE information from blood did not distinguish (likely) pathogenic variants from benign variants. Unexpectedly, we identified 5 genes (ALOX5, COMT, PRPF8, PSTPIP1 and SH3BP2) in which seven population-based samples had a predicted high impact, autosomal-dominant variant. For these genes the imbalanced expression of the major allele compensates for the lower expression of the minor allele. ConclusionsOur analysis in a large population-based gene expression cohort reveals examples of high impact, autosomal-dominant variants that are compensated for by imbalanced expression. Additionally, we observed that ASE analyses in blood are informative for predicting pathogenic variants that are associated with immune and haematological conditions. We have made all our ASE results, including many ASE calls for rare variants (MAF < 1%), available at https://molgenis15.gcc.rug.nl/.

8

SCIA: A fast and widely applicable pipeline for measuring expanded repeat instability

Smith, C.; Peter Durairaj, R. R.; Randall, E. L.; Aston, A. N.; Heraty, L.; Elsayed, W.; Murillo, A.; Dion, V.

2026-03-15 neuroscience 10.64898/2026.03.12.707943 medRxiv

Top 0.1%

14.9%

Show abstract

The expansion of short tandem repeats is a feature of over 60 different human diseases. Ongoing somatic instability throughout a patients lifetime can influence disease progression and has emerged as a therapeutic target. Understanding its mechanism is essential for the identification of both drug targets and therapeutic interventions. A major obstacle towards this translational goal has been to measure changes in repeat size distribution in a timely manner. To address this, here we present Single Clone-based Instability Assay (SCIA), a streamlined experimental design that saves weeks in assessing the effect of a gene knockout on repeat instability. The approach avoids bulk cultures and does not require a reporter cell line. It uses targeted long-read sequencing as a readout for repeat instability. We have validated the approach using FAN1, PMS1, and MLH1 knockouts in HEK293-derived cells. We provide a visualization software that generates delta plots, extracts the instability frequency, the bias towards expansion or contraction, and the average size of the changes. Using SCIA, we find that although FAN1 knockout clones showed increased frequency of expansions, the size of the expansions were smaller. This highlights the wealth of information that can be extracted and the potential for novel insights into the mechanism of repeat instability.

9

Proteomic Fingerprinting: A novel privacy concern

Hill, A. C.; Litkowski, E. M.; Manichaikul, A.; Lange, L.; Pratte, K. A.; Kechris, K. J.; DeCamp, M.; Coors, M.; Ortega, V. E.; Rich, S. S.; Rotter, J. I.; Gerzsten, R. E.; Clish, C. B.; Curtis, J.; Hu, X.; Ngo, D.; ONeal, W. K.; Meyers, D.; Bleecker, E.; Hobbs, B. D.; Cho, M. H.; Banaei-kashani, F.; Bowler, R. P.

2022-04-12 genetic and genomic medicine 10.1101/2022.04.06.22269907 medRxiv

Top 0.1%

14.6%

Show abstract

IntroductionPrivacy protection is a core principle of genomic research but needs further refinement for high-throughput proteomic platforms. MethodsWe identified independent single nucleotide polymorphism (SNP) quantitative trait loci (pQTL) from COPDGene and Jackson Heart Study (JHS) and then calculated genotype probabilities by protein level for each protein-genotype combination (training). Using the most significant 100 proteins, we applied a naive Bayesian approach to match proteomes to genomes for 2,812 independent subjects from COPDGene, JHS, SubPopulations and InteRmediate Outcome Measures In COPD Study (SPIROMICS) and Multi-Ethnic Study of Atherosclerosis (MESA) with SomaScan 1.3K proteomes and also 2,646 COPDGene subjects with SomaScan 5K proteomes (testing). We tested whether subtracting mean genotype effect for each pQTL SNP would obscure genetic identity. ResultsIn the four testing cohorts, we were able to correctly match 90%-95% their proteomes to their correct genome and for 95%-99% we could match the proteome to the 1% most likely genome. With larger profiling (SomaScan 5K), correct identification was > 99%. The accuracy of matching in subjects with African ancestry was lower ([~]60%) unless training included diverse subjects. Mean genotype effect adjustment reduced identification accuracy nearly to random guess. ConclusionLarge proteomic datasets (> 1,000 proteins) can be accurately linked to a specific genome through pQTL knowledge and should not be considered deidentified. These findings suggest that large scale proteomic data be given privacy protections of genomic data, or that bioinformatic transformations (such as adjustment for genotype effect) should be applied to obfuscate identity.

10

Identification and functional characterization of an AMD associated c-ABL binding SNP streak within the ARMS2 gene promoter region

Zhang, P.-W.; Liu, S.; Li, W.; Fan, L.; Li, S.; Wan, Z.-H.; Berlinicke, C. A.; Merbs, S. L.; Zack, D. J.

2025-10-27 molecular biology 10.1101/2025.10.27.684937 medRxiv

Top 0.1%

14.6%

Show abstract

BackgroundLarge-scale genome-wide association studies (GWAS) have identified the human 10q26 locus as a major genetic risk factor for age-related macular degeneration (AMD). The AMD-associated interval has been refined to a 5,196 bp segment flanking the ARMS2-HTRA1 region, excluding HTRA1 and the ARMS2 3' indel (443del54ins) variant by risk haplotype analysis. Although the missense SNP rs10490924 has been proposed as a functional variant, its role in AMD remains controversial, and the causative variants and underlying mechanisms within this region remain unresolved. MethodsAn unbiased bioinformatic screen identified a 5-SNP block within the 5,196 bp interval that potentially alters c-ABL protein binding. Protein-DNA interactions were validated using electrophoretic mobility shift assay (EMSA) and chromatin immunoprecipitation (ChIP) assays. Genetic association with AMD (dry and wet subtypes) was assessed in patient cohorts using blood genomic DNA. The regulatory effect of the 5-SNP block was further examined using luciferase reporter assays. FindingsWe identified a 5-SNP block located [~]556 bp upstream of the ARMS2 start codon, representing a cluster of predicted c-ABL tyrosine kinase binding sites. This block, in complete linkage disequilibrium with rs10490924 (A69S), showed a strong association with both wet and dry AMD (136 controls, 179 dry AMD, 251 wet AMD). EMSA and ChIP confirmed direct c-ABL binding, while luciferase reporter assays demonstrated reduced transcriptional activity mediated by the 5-SNP block in the presence of c-ABL. InterpretationOur results suggest that the c-ABL-responsive 5-SNP regulatory streak in the ARMS2 promoter region act as functional non-coding elements that may contribute to AMD pathogenesis through altered transcriptional regulation.

11

Benchmark of Wide Range of Pairwise Distance Metrics for Automated Classification of Mouse Mutant Phenotypes from Flow Cytometry Data

May, M.; Hewitt, T.; Mashford, B.; Hammill, D.; Davies, A.; Andrews, T. D.

2025-01-06 bioinformatics 10.1101/2025.01.06.631468 medRxiv

Top 0.1%

14.5%

Show abstract

Precision medicine requires a comprehensive mapping of genotype to phenotype to provide patients with individually tailored treatment. However, when using flow cytometry to identify phenotypes, such as the quantity of various immune cell populations in tissue and blood used to identify autoimmune disorders, it is often unclear which cellular phenotypes are from healthy and disease individuals, especially when including the effects of population diversity, due to the high-dimensional nature of the data. To identify and segregate healthy phenotype from various disease phenotypes, we use pairwise distance metrics between each samples cell populations. By comparing distance metrics between C57BL/6 clone mice with mutations of known phenotype, we find that cosine similarity is best suited for segregating wildtype from mutant samples while respecting minute differences in already small cell populations, and that standardised Euclidean distance is best suited for machine-learning input due to its sensitivity. Both metrics outperform other tested metrics (including Aitchison, Euclidean, Manhattan, Earth-Movers Distance, and squared Euclidean). We demonstrate the utility of these different pairwise metrics through their application to a classification task of known mutant phenotypes: using an existing FACS phenotype dataset derived from X000 inbred C57BL/6 mice that harbour potentially phenotypic genetic variation introduced through ENU mutagenesis of individual pedigree-founding G0 male mice.

12

Integrative Multi-Omics Framework for Causal Gene Discovery in Long COVID

Le, T. D.; Pinero, S. L.; Li, X.; Liu, L.; Li, J.; Lee, S. H.; Winter, M.; Nguyen, T.; Zhang, J.

2025-02-12 genetic and genomic medicine 10.1101/2025.02.09.25321751 medRxiv

Top 0.1%

14.5%

Show abstract

BackgroundLong COVID, or Post-Acute Sequelae of COVID-19 (PASC), involves persistent, multisystemic symptoms in about 10-20% of COVID-19 patients. Although age, sex, ethnicity, and comorbidities are recognized as risk factors, identifying genetic contributors is essential for developing targeted therapies. MethodsWe developed a multi-omics framework using Transcriptome-Wide Mendelian Randomization (TWMR) and Control Theory (CT). This approach integrates Expression Quantitative Trait Loci (eQTL), Genome-Wide Association Studies (GWAS), RNA sequencing (RNA-seq), and Protein-Protein Interaction (PPI) networks to detect causal genes and regulatory nodes that drive critical expression changes in Long COVID. ResultsWe identified 32 causal genes (19 previously reported and 13 novel), which act as regulatory drivers influencing disease risk, progression, and stability. Enrichment analyses highlighted pathways linked to the SARS-CoV-2 response, viral carcinogenesis, cell cycle regulation, and immune function. Analysis of other pathophysiological conditions revealed shared genetic factors across syndromic, metabolic, autoimmune, and connective tissue disorders. Using these genes, we identified three distinct symptom-based subtypes of Long COVID, offering insights for more precise diagnosis and potential therapeutic interventions. Additionally, we provided an open-source Shiny application to enable further data exploration. ConclusionIntegrating TWMR and CT revealed genetic mechanisms and therapeutic targets for Long COVID, with novel genes informing pathogenesis and precision medicine strategies.

13

Spatial Distribution of Missense Variants within Complement Proteins Associates with Age Related Macular Degeneration

Grunin, M.; de Jong, S.; Palmer, E. L.; Jin, B.; Rinker, D.; Moth, C.; Capra, J. A.; Haines, J. L.; Bush, W.; den Hollander, A.; International Age-related Macular Degeneration Genomics Consortium,

2023-08-31 genetic and genomic medicine 10.1101/2023.08.28.23294686 medRxiv

Top 0.1%

14.4%

Show abstract

PurposeGenetic variants in complement genes are associated with age-related macular degeneration (AMD). However, many rare variants have been identified in these genes, but have an unknown significance, and their impact on protein function and structure is still unknown. We set out to address this issue by evaluating the spatial placement and impact on protein structureof these variants by developing an analytical pipeline and applying it to the International AMD Genomics Consortium (IAMDGC) dataset (16,144 AMD cases, 17,832 controls). MethodsThe IAMDGC dataset was imputed using the Haplotype Reference Consortium (HRC), leading to an improvement of over 30% more imputed variants, over the original 1000 Genomes imputation. Variants were extracted for the CFH, CFI, CFB, C9, and C3 genes, and filtered for missense variants in solved protein structures. We evaluated these variants as to their placement in the three-dimensional structure of the protein (i.e. spatial proximity in the protein), as well as AMD association. We applied several pipelines to a) calculate spatial proximity to known AMD variants versus gnomAD variants, b) assess a variants likelihood of causing protein destabilization via calculation of predicted free energy change (ddG) using Rosetta, and c) whole gene-based testing to test for statistical associations. Gene-based testing using seqMeta was performed using a) all variants b) variants near known AMD variants or c) with a ddG >|2|. Further, we applied a structural kernel adaptation of SKAT testing (POKEMON) to confirm the association of spatial distributions of missense variants to AMD. Finally, we used logistic regression on known AMD variants in CFI to identify variants leading to >50% reduction in protein expression from known AMD patient carriers of CFI variants compared to wild type (as determined by in vitro experiments) to determine the pipelines robustness in identifying AMD-relevant variants. These results were compared to functional impact scores, ie CADD values > 10, which indicate if a variant may have a large functional impact genomewide, to determine if our metrics have better discriminative power than existing variant assessment methods. Once our pipeline had been validated, we then performed a priori selection of variants using this pipeline methodology, and tested AMD patient cell lines that carried those selected variants from the EUGENDA cohort (n=34). We investigated complement pathway protein expression in vitro, looking at multiple components of the complement factor pathway in patient carriers of bioinformatically identified variants. ResultsMultiple variants were found with a ddG>|2| in each complement gene investigated. Gene-based tests using known and novel missense variants identified significant associations of the C3, C9, CFB, and CFH genes with AMD risk after controlling for age and sex (P=3.22x10-5;7.58x10-6;2.1x10-3;1.2x10-31). ddG filtering and SKAT-O tests indicate that missense variants that are predicted to destabilize the protein, in both CFI and CFH, are associated with AMD (P=CFH:0.05, CFI:0.01, threshold of 0.05 significance). Our structural kernel approach identified spatial associations for AMD risk within the protein structures for C3, C9, CFB, CFH, and CFI at a nominal p-value of 0.05. Both ddG and CADD scores were predictive of reduced CFI protein expression, with ROC curve analyses indicating ddG is a better predictor (AUCs of 0.76 and 0.69, respectively). A priori in vitro analysis of variants in all complement factor genes indicated that several variants identified via bioinformatics programs PathProx/POKEMON in our pipeline via in vitro experiments caused significant change in complement protein expression (P=0.04) in actual patient carriers of those variants, via ELISA testing of proteins in the complement factor pathway, and were previously unknown to contribute to AMD pathogenesis. ConclusionWe demonstrate for the first time that missense variants in complement genes cluster together spatially and are associated with AMD case/control status. Using this method, we can identify CFI and CFH variants of previously unknown significance that are predicted to destabilize the proteins. These variants, both in and outside spatial clusters, can predict in-vitro tested CFI protein expression changes, and we hypothesize the same is true for CFH. A priori identification of variants that impact gene expression allow for classification for previously classified as VUS. Further investigation is needed to validate the models for additional variants and to be applied to all AMD-associated genes.

14

Dissecting the Cellular Genetics of Cardiovascular Disease Through Endothelial and Immune Compartments Profiling

Watt, S. B.; Ozols, M.; Landini, A.; LIU, B.; Sharapov, S.; Zanotti, D.; Gabriela, I. D.; Solomon, C. U.; Yang, X. D.; Balogun, T.; Staudt, N.; Li, W.; Al-Janabi, F.; Webb, T. R.; McVey, D. G.; Morrell, N. W.; Bennett, M. R.; Samani, N. J.; Gräf, S.; Ye, S.; Pirastu, N.; Soranzo, N.; Frontini, M.

2025-11-12 genomics 10.1101/2025.11.10.687750 medRxiv

Top 0.1%

14.2%

Show abstract

BackgroundNon-communicable diseases such as coronary artery disease, atrial fibrillation, type 2 diabetes, hypertension, and others share endothelial dysfunction as one of their underlying features. The endothelium, as the interface between blood and vasculature, shapes disease onset and progression through its response to environmental cues. However, while the genetic component of these diseases has been captured by genome wide association studies (GWAS), which also highlighted a shared immune component, it remains unclear which of these disease loci exerts their effects through endothelial cells. This study identifies, and quantifies, the genetic determinants of endothelial cells molecular traits and their overlap to the common genetic variation component of these diseases. MethodsWe generated genotype, RNA-sequencing, H3K27ac ChIP-sequencing, ATAC-sequencing, and endothelial cells barrier stimuli response measurements for 100 samples of human umbilical vein endothelial cells. These were used to identify quantitative trait loci (QTL) for gene expression, transcriptional isoform usage, splice junction usage, chromatin activity and barrier response. We applied statistical colocalisation to identify the overlap between data layers, and to explain molecular QTLs contribution to GWAS disease loci. ResultsWe used molecular QTLs to identify the regulatory features of 8,214 genes, representing 36% of all expressed genes in endothelial cells. We also identified the molecular mechanisms underlying 815 loci across 16 disease GWAS. These represent between 29% and 40% of all loci for each disease, compared to the previous average of 23%. This is due to the choice of a cell type often underrepresented in tissue level data, and the inclusion of isoform, splicing and chromatin activity datasets. Furthermore, we compared the endothelial cells molecular QTLs with similar datasets in monocytes, neutrophils and CD4 T lymphocytes to shed light on the interplay between the endothelial and the immune compartments in these diseases. We identified loci acting through both the endothelial and the immune compartment, mostly with the same directionality of effect, and endothelial specific ones. ConclusionsThis work expands the knowledge of the mechanisms and genes underlying the effect of common genetic variation on non-communicable diseases having endothelial dysfunction as a shared feature. It also illustrates the interplay between endothelial cells and immune cell types in these diseases, highlighting shared and unique pathways.

15

Multitrait genetic-phenotype associations to connect disease variants and biological mechanisms

Julienne, H.; Laville, V.; McCaw, Z. R.; He, Z.; Guillemot, V.; Lasry, C.; Ziyatdinov, A.; Vaysse, A.; Lechat, P.; Menager, H.; Le Goff, W.; Dube, M.-P.; Kraft, P.; Ionita-Laza, I.; Vilhjalmsson, B. J.; Aschard, H.

2020-10-23 genetics 10.1101/2020.06.26.172999 medRxiv

Top 0.1%

14.1%

Show abstract

BackgroundGenome-wide association studies (GWAS) uncovered a wealth of associations between common variants and human phenotypes. These results, widely shared across the scientific community as summary statistics, fostered a flurry of secondary analysis: heritability and genetic correlation assessment, pleiotropy characterization and multitrait association test. Amongst these secondary analyses, a rising new field is the decomposition of multitrait genetic effects into distinct profiles of pleiotropy. ResultsWe conducted an integrative analysis of GWAS summary statistics from 36 phenotypes to decipher multitrait genetic architecture and its link to biological mechanisms. We started by benchmarking multitrait association tests on a large panel of phenotype sets and established the Omnibus test as the most powerful in practice. We detected 322 new associations that were not previously reported by univariate screening. Using independent significant associations, we investigated the breakdown of genetic association into clusters of variants harboring similar multitrait association profile. Focusing on two subsets of immunity and metabolism phenotypes, we then demonstrate how SNPs within clusters can be mapped to biological pathways and disease mechanisms, providing a putative insight for numerous SNPs with unknown biological function. Finally, for the metabolism set, we investigate the link between gene cluster assignment and success of drug targets in random control trials. We report additional uninvestigated drug targets classified by clusters. ConclusionsMultitrait genetic signals can be decomposed into distinct pleiotropy profiles that reveal consistent with pathways databases and random control trials. We propose this method for the mapping of unannotated SNPs to putative pathways.

16

Integrative Mendelian Randomization approaches for therapeutic target prioritisation in immune-mediated diseases

Sobczyk, M. K.; Gaunt, T. R.

2024-04-29 genetic and genomic medicine 10.1101/2024.04.27.24306475 medRxiv

Top 0.1%

12.9%

Show abstract

BackgroundImmune-mediated diseases (IMD) encompass a wide range of autoimmune and inflammatory disorders with aetiology related to immune system dysfunction, signifying a disease area with great potential for drug repurposing. In this study, we employed the genetically informed Mendelian Randomization (MR) method with two distinct exposure types: immune blood cell abundance and protein quantitative trait loci (pQTL) to validate and repurpose 834 drug targets which have been investigated for IMD treatment. MethodsUtilizing two-sample MR, we first established causal relationships between major peripheral immune cell types and 14 IMD. Robust associations, particularly with eosinophils, were confirmed across diseases such as asthma, eczema, sinusitis, and rheumatoid arthritis, revealing 59 high-confidence relationships. Intragenic variants associated with causal immune cell types were then extracted to create instruments for 371 existing IMD drug targets ("intermediate trait" MR). In parallel, we leveraged four large blood plasma protein QTL datasets to obtain complementary instruments for 361 targets ("pQTL" MR). ResultsIn the intermediate trait MR analysis, we identified 811 gene-IMD associations (p-value <0.05), 169 of which were supported by strong colocalisation evidence (PPH4 [≥] 0.8). In the pQTL MR analysis, we similarly found 841 protein-IMD associations (p-value <0.05), 83 of which were confirmed with colocalization. Comparison with a list of approved drugs indicated low sensitivities across disease outcomes for both exposure types (intermediate trait MR: 0.49 {+/-} 0.23 SD, pQTL MR: 0.28 {+/-} 0.12 SD). ConclusionsDrug targets identified in the pQTL and intermediate trait MR analyses show limited overlap (13%), presenting a comprehensive source of drug repurposing opportunities when the two approaches are combined.

17

Insight into risk associated phenotypes behind COVID-19 from phenotype genome-wide association studies

Chen, Z.; Matsuda, K.

2023-05-14 public and global health 10.1101/2023.05.09.23289706 medRxiv

Top 0.1%

12.7%

Show abstract

Long COVID presents a complex and multi-systemic disease that poses a significant global public health challenge. Symptoms can vary widely, ranging from asymptomatic to severe, making the condition challenging to diagnose and manage effectively. Furthermore, identifying appropriate phenotypes in genome-wide association studies of COVID-19 remains unresolved. This study aimed to address these challenges by analyzing 220 deep-phenotype genome-wide association data sets (159 diseases, 38 biomarkers and 23 medication usage) from BioBank Japan (BBJ) (n=179,000), UK Biobank and FinnGen (n=628,000) to investigate pleiotropic effects of known COVID-19 risk associated single nucleotide variants. Our findings reveal 32 different phenotypes that share the common genetic risk factors with COVID-19 (p < 7.6x10-11), including two diseases (myocardial infarction and type 2 diabetes), 26 biomarkers with seven categories (blood cell, metabolic, liver-related, kidney-related, protein, inflammatory and anthropometric), and four medications (antithrombotic agents, HMG CoA reductase inhibitors, thyroid preparations and anilides). As long COVID continues to coexist with humans, our results highlight the need for targeted screening to support specific vulnerable populations to improve disease prevention and healthcare delivery.

18

Ethnicity-specific transcriptomic variation in immune cells and correlation with disease activity in systemic lupus erythematosus

Andreoletti, G.; Lanata, C. M.; Paranjpe, I.; Jain, T. S.; Nititham, J.; Taylor, K. E.; Combes, A. J.; Maliskova, L.; Jimmie Ye, C.; Katz, P.; Dall Era, M.; Yazdany, J.; Criswell, L. A.; Sirota, M.

2020-11-01 bioinformatics 10.1101/2020.10.30.362715 medRxiv

Top 0.1%

12.7%

Show abstract

Systemic lupus erythematosus (SLE) is a heterogeneous autoimmune disease in which outcomes vary among different racial groups. The aim of this study is to leverage large-scale transcriptomic data from diverse populations to better sub-classify SLE patients into more clinically actionable groups. We leverage cell sorted RNA-seq data (CD14+ monocytes, B cells, CD4+T cells, and NK cells) from 120 SLE patients (63 Asian and 57 White individuals) and apply a four tier analytical approach to identify SLE subgroups within this multiethnic cohort: unsupervised clustering, differential expression analyses, gene co-expression analyses, and machine learning. K-means clustering on the individual cell type data resulted in three clusters for CD4 and CD14, and two clusters for B cells and NK cells. Correlation analysis revealed significant positive associations between the transcriptomic clusters of each immune cell and clinical parameters including disease activity and ethnicity. We then explored differentially expressed genes between Asian and White groups for each cell-type. The shared differentially expressed genes across the four cell types were involved in SLE or other autoimmune related pathways. Co-expression analysis identified similarly regulated genes across samples and grouped these genes into modules. Samples were grouped into White-high, Asians-high (high disease activity defined by SLEDAI score >=6) and White-low, Asians-low (SLEDAI < 6). Random forest classification of disease activity in the White and Asian cohorts showed the best classification in CD4+ T cells in White. The results from these analyses will help stratify patients based on their gene expression signatures to enable precision medicine for SLE.

19

Region- and variance-based DNA methylation analyses reveal novel disease genes and pathways for systemic lupus erythematosus

Guo, M.; Wang, T.-Y.; Shen, J. J.; Wang, Y.-F.; Lau, Y.-L.; Yang, W.

2022-11-24 bioinformatics 10.1101/2022.11.23.516835 medRxiv

Top 0.1%

12.7%

Show abstract

BackgroundSystemic lupus erythematosus (SLE) is a prototype autoimmune disease with unclear pathogenesis. DNA methylation is an important regulatory mechanism on gene expression, providing a key angle to understand disease mechanisms. To understand the pathways involved in SLE, and to develop biomarkers for its diagnosis and treatment, we analyzed DNA methylation profiles on blood cells from SLE patients and healthy controls. ResultsWe identified most differentially methylated regions (DMRs) in T cells, while majority of differentially variable sites (DVSs) were found in B cells, featuring hypervariability in enhancers. We observed a prominent T cell receptor (TCR) signaling cluster with consistent hypermethylation and a B cell receptor (BCR) cluster with highly increased variability in SLE. Genes involved in innate immunity were often found hypomethylated, while adaptive immunity genes were featured with hypermethylation. Using a machine learning approach, we identified 60 genes that accurately distinguished SLE patients from healthy individuals, which also showed correlation with disease activities. ConclusionsThis study highlights the role of lymphocyte receptor aberrations in the disease and identified a list of genes showing great potential as biomarkers and shedding new light on disease mechanisms, through novel analyses of methylation data.

20

Single-cell Landscape of Immune Cells in Blood and Skin in Psoriasis

Deng, J.; Nordkamp, M. O.; Ye, S.; Ye, J.; Balak, D.; Yu, W.; Radstake, T.; Borghans, J. A. M.; Lu, C.; Pandit, A.; Gerritsen, B.

2024-09-21 systems biology 10.1101/2024.09.17.613463 medRxiv

Top 0.1%

12.6%

Show abstract

BackgroundPsoriasis is a systemic inflammatory disease for which there is currently no cure, in part due to an incomplete understanding of its pathophysiology. MethodsTo better understand the immune response in psoriasis, we performed single-cell RNA sequencing (scRNA-seq) on peripheral blood mononuclear cells (PBMCs) and on lesional and non-lesional skin samples from a cohort of 11 psoriasis patients and 8 healthy controls. Additionally, we conducted flow cytometry on PBMCs from a separate cohort of 13 psoriasis patients and 11 ankylosing spondylitis. FindingsOur study revealed altered immune signatures of specific myeloid and lymphocyte subsets in blood and skin, both in terms of cell numbers and gene expression. Specifically, we discovered elevated proportions of circulating CD14++ monocytes, increased expression of major histocompatibility complex (MHC) class II molecule by circulating CD16+ monocytes, as well as increased expression of genes related to skin homing and to pro-inflammatory responses in psoriasis by circulating plasmacytoid dendritic cells (pDCs). Circulating CD8+ T effector memory cells in psoriasis patients exhibited reduced abundance but increased skin-homing potential. In psoriatic lesions, we observed a hyperinflammatory myeloid-cell state and enrichment of IL17-producing cells with a tissue-resident memory T-cell signature. InterpretationThe changes in immune cell numbers and gene expression indicate a significant alteration in the immune landscape of psoriasis patients. This suggests that the immune system in psoriasis is reprogrammed, affecting both innate and adaptive branches. These findings provide new insights into the aberrant immune-cell signatures in the circulation and skin lesions in psoriasis, and thereby help to understand its pathophysiology. FundingThis study was financially supported by the National Natural Science Foundation of China (U23A6012), Science and Technology Planning Project of Guangzhou (2024A03J0055, 202206080005), Innovation Team and Talents Cultivation Program of National Administration of Traditional Chinese Medicine (ZYYCXTD-C-202204).